Regularization

Need for Regularization

Imagine a traditional classroom, we have quizzes and exams. Quizzes test your learning and exams validate our knowledge gained in that course on the whole. However, the ultimate goal for a course is training a student to be able to apply that concept in other fields of career not just to get good marks in quizzes and exams. Similarly, we do not train a neural network to get amazing performance on the training set. We want a neural network perform well on entirely new data. Compared to training data performace, networks usually does not perform as well with new data.

I want to start the discussion with occum's Razer which suggests us to choose the simplest model that works. Choosing a simple model for a neural network is difficult because it is inherently complex. A neural network learns the distribution of data while training so that it can work on new data in that distribution(test accuracy). There will be a slight performance drop between the training time and testing time and this drop is called generalization error. In few cases when the training is too aggressive, the network starts learning the data and starts fitting the data instead of learning the data distribution which results in a poor performance at test time. This is a result of poor generalizability of the network. We use some techniques to improve the generalization of a network and these techniques are also called as regularization. There are three main regularizations used in neural networks

  1. Classic Regularization ($L^1$ and $L^2$)
  2. Dropouts
  3. Batch Normalization

Classic Regularization(Weight Norm penalities):

Regularization helps us to simplify our final model even with a complex architecture. One classic type of regularization is weight penalities which keeps the values of weight vectors in check. We achieve this we add the norm of the weight vector to the error function to get the final cost function. We can use any norm from $L^1$ to $L^\infty$. The most widely used norms are $L^2$ and $L^1$.

$L^2$ Regularization

$L^2$ Regularization is also called as Ridge Regression or Tikhonov regularization. Among the weight penalities $L^2$ is the most used weight penality. $L^2$ Regularization penalizes the bigger weights. We achieve regularization by adding square of $L^2$ norm to the cost function. mathematical representation of $L^2$ regularization is given by: $$Cost = E(X) + \lambda \parallel W \parallel_2 ^ 2$$ New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by: $$g = \frac{\partial E(X)}{\partial W} + 2 \lambda W$$

$\lambda$ is the regularization coefficient that can be used to control the level of regularization.

$L^1$ Regularization

In $L^1$ Regularization we add the first norm of the weight vector to the cost function. $L^1$ Regularization penalizes the weights that are not zero. It forces the weights to be zero as a result of which the final parameters are sparse with most of the weights bring zero. Mathematical representation of $L^1$ regularization is given by: $$Cost = E(X) + \lambda \parallel W \parallel_1$$ New Gradient g of the cost function $E(X)$ w.r.t to Weights w is given by: $$g = \frac{\partial E(X)}{\partial W} + \lambda sign(W)$$

combination of Norm penalities:

We do not have to restrict ourselves to one weight Norm penality for a parameter. We can have a combination of more than one weight penalities. Our final model will be impacted by the properties of all the regularizers. For example, If we use both $L^1$ and $L^2$ weight penalities in our model then the cost function becomes $$Cost = E(X) + \lambda_2 \parallel W \parallel_2 ^ 2 + \lambda_1 \parallel W \parallel_1$$ New Gradient g of the cost function $E(X)$ w.r.t to Weight vector W is given by: $$g = \frac{\partial E(X)}{\partial W} + 2 \lambda_2 W + \lambda sign(W) $$

Regularization by Norm Penalities in YANN:

YANN has a flexibility of regularizing selected layer or an entire network. To regularize a layer, we should set the following arguments for network.add_layer() function

regularize – True is you want to apply regularization, False if not.
regularizer – coeffients for L1, L2 regulaizer coefficients,Default is (0.001, 0.001).
To give common regularization parameters for entire network, we can give regularization argument for optimizer parameters.
"regularization"    : (l1_coeff, l2_coeff). Default is (0.001, 0.001) 

Let's see Regularization in action:


In [1]:
from yann.network import network
from yann.utils.graph import draw_network
from yann.special.datasets import cook_mnist
def lenet5 ( dataset= None, verbose = 1, regularization = None ):             
    """
    This function is a demo example of lenet5 from the infamous paper by Yann LeCun. 
    This is an example code. You should study this code rather than merely run it.  
    
    Warning:
        This is not the exact implementation but a modern re-incarnation.

    Args: 
        dataset: Supply a dataset.    
        verbose: Similar to the rest of the dataset.
    """
    optimizer_params =  {        
                "momentum_type"       : 'nesterov',             
                "momentum_params"     : (0.65, 0.97, 30),      
                "optimizer_type"      : 'rmsprop',                
                "id"                  : "main"
                        }

    dataset_params  = {
                            "dataset"   : dataset,
                            "svm"       : False, 
                            "n_classes" : 10,
                            "id"        : 'data'
                      }

    visualizer_params = {
                    "root"       : 'lenet5',
                    "frequency"  : 1,
                    "sample_size": 144,
                    "rgb_filters": True,
                    "debug_functions" : False,
                    "debug_layers": False,  # Since we are on steroids this time, print everything.
                    "id"         : 'main'
                        }       

    # intitialize the network
    net = network(   borrow = True,
                     verbose = verbose )                       
    
    # or you can add modules after you create the net.
    net.add_module ( type = 'optimizer',
                     params = optimizer_params, 
                     verbose = verbose )

    net.add_module ( type = 'datastream', 
                     params = dataset_params,
                     verbose = verbose )

    net.add_module ( type = 'visualizer',
                     params = visualizer_params,
                     verbose = verbose 
                    )
    # add an input layer 
    net.add_layer ( type = "input",
                    id = "input",
                    verbose = verbose, 
                    datastream_origin = 'data', # if you didnt add a dataset module, now is 
                                                 # the time. 
                    mean_subtract = False )
    
    # add first convolutional layer
    net.add_layer ( type = "conv_pool",
                    origin = "input",
                    id = "conv_pool_1",
                    num_neurons = 20,
                    filter_size = (5,5),
                    pool_size = (2,2),
                    activation = 'maxout(2,2)',
                    # regularize = True,
                    verbose = verbose
                    )

    net.add_layer ( type = "conv_pool",
                    origin = "conv_pool_1",
                    id = "conv_pool_2",
                    num_neurons = 50,
                    filter_size = (3,3),
                    pool_size = (2,2),
                    activation = 'relu',
                    # regularize = True,
                    verbose = verbose
                    )      


    net.add_layer ( type = "dot_product",
                    origin = "conv_pool_2",
                    id = "dot_product_1",
                    num_neurons = 1250,
                    activation = 'relu',
                    # regularize = True,
                    verbose = verbose
                    )

    net.add_layer ( type = "dot_product",
                    origin = "dot_product_1",
                    id = "dot_product_2",
                    num_neurons = 1250,                    
                    activation = 'relu',  
                    # regularize = True,    
                    verbose = verbose
                    ) 
    
    net.add_layer ( type = "classifier",
                    id = "softmax",
                    origin = "dot_product_2",
                    num_classes = 10,
                    # regularize = True,
                    activation = 'softmax',
                    verbose = verbose
                    )

    net.add_layer ( type = "objective",
                    id = "obj",
                    origin = "softmax",
                    objective = "nll",
                    datastream_origin = 'data', 
                    regularization = regularization,                
                    verbose = verbose
                    )
                    
    learning_rates = (0.05, .0001, 0.001)  
    net.pretty_print()  
    # draw_network(net.graph, filename = 'lenet.png')   

    net.cook()

    net.train( epochs = (20, 20), 
               validate_after_epochs = 1,
               training_accuracy = True,
               learning_rates = learning_rates,               
               show_progress = True,
               early_terminate = True,
               patience = 2,
               verbose = verbose)

    print(net.test(verbose = verbose))
data = cook_mnist()
dataset = data.dataset_location()
lenet5 ( dataset, verbose = 2)


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-0332a69d7074> in <module>()
----> 1 from yann.network import network
      2 from yann.utils.graph import draw_network
      3 from yann.special.datasets import cook_mnist
      4 def lenet5 ( dataset= None, verbose = 1, regularization = None ):
      5     """

build/bdist.linux-x86_64/egg/yann/network.py in <module>()
     20 
     21 import numpy
---> 22 import theano
     23 import theano.tensor as T
     24 

/home/tonyhunt/anaconda3/envs/python2/lib/python2.7/site-packages/theano/__init__.pyc in <module>()
    114 
    115         if config.enable_initial_driver_test:
--> 116             theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
    117 
    118 if (config.device.startswith('cuda') or

/home/tonyhunt/anaconda3/envs/python2/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.pyc in test_nvidia_driver1()
     38              'but got:'] + [str(app) for app in topo])
     39         raise AssertionError(msg)
---> 40     if not numpy.allclose(f(), a.sum()):
     41         raise Exception("The nvidia driver version installed with this OS "
     42                         "does not give good results for reduction."

/home/tonyhunt/anaconda3/envs/python2/lib/python2.7/site-packages/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    896                     node=self.fn.nodes[self.fn.position_of_error],
    897                     thunk=thunk,
--> 898                     storage_map=getattr(self.fn, 'storage_map', None))
    899             else:
    900                 # old-style linkers raise their own exceptions

/home/tonyhunt/anaconda3/envs/python2/lib/python2.7/site-packages/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    323         # extra long error message in that case.
    324         pass
--> 325     reraise(exc_type, exc_value, exc_trace)
    326 
    327 

/home/tonyhunt/anaconda3/envs/python2/lib/python2.7/site-packages/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    882         try:
    883             outputs =\
--> 884                 self.fn() if output_subset is None else\
    885                 self.fn(output_subset=output_subset)
    886         except Exception:

RuntimeError: Cuda error: kernel_reduce_ccontig_node_544270fe7a21a748315f83abfe0913cc_0: out of memory. (grid: 1 x 1; block: 256 x 1 x 1)

Apply node that caused the error: GpuCAReduce{add}{1}(<CudaNdarrayType(float32, vector)>)
Toposort index: 0
Inputs types: [CudaNdarrayType(float32, vector)]
Inputs shapes: [(10000,)]
Inputs strides: [(1,)]
Inputs values: ['not shown']
Outputs clients: [[HostFromGpu(GpuCAReduce{add}{1}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

In [ ]:
net.